{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# First Classifier\n",
    "\n",
    "## Goals\n",
    "1. Build our first classifier!\n",
    "2. Choosing a machine learning algorithms\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "In lesson two we transformed our twitter sentiment data into a feature matrix.  Now we can apply virtually any machine learning algorithm to our data and the python scikit-learn package makes it really easy to try any technique we want.  The algorithms have subtle tradeoffs and the names can be really confusing.  Many people new to machine learning spend too much time looking for the perfect machine learning algorithm for their data.  In reality training data and feature extraction are almost always more important, however chosing the wrong algorithm can cause problems.  In this notebook, we're going to go over how to pick an algorithm and evaluate if it's working well.\n",
    "\n",
    "## Picking an Algorithm\n",
    "\n",
    "Scikit learn has a great flowchart for chosing an algorithm at [scikit-learn.org/stable/tutorial/machine_learning_map/](http://scikit-learn.org/stable/tutorial/machine_learning_map/).\n",
    "\n",
    "![](http://scikit-learn.org/stable/_static/ml_map.png)\n",
    "\n",
    "Let's walk through this flowchart on our data starting at the \"START\" circle in the upper right:\n",
    "1. Do we have >50 samples?  Yes.\n",
    "2. Are we predicting a category? Yes. We have four categories Positive, Negative, No Emotion and Can't tell.\n",
    "3. Do we have labeled data? Yes.\n",
    "4. Do we have <100K samples? Yes.\n",
    "5. We have arrived at LinearSVC.  \n",
    "\n",
    "If you are actually looking at the flowchart on the scikit webpage, you can click on the green box and go to the LinearSVC documentation.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Positive emotion']\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df = pd.read_csv('../scikit/tweets.csv')\n",
    "target = df['is_there_an_emotion_directed_at_a_brand_or_product']\n",
    "text = df['tweet_text']\n",
    "\n",
    "# We need to remove the empty rows from the text before we pass into CountVectorizer\n",
    "fixed_text = text[pd.notnull(text)]\n",
    "fixed_target = target[pd.notnull(text)]\n",
    "\n",
    "# Do the feature extraction\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "count_vect = CountVectorizer()              # initialize the count vectorizer\n",
    "count_vect.fit(fixed_text)                  # set up the columns for the feature matrix\n",
    "counts = count_vect.transform(fixed_text)   # counts is the feature matrix\n",
    "\n",
    "from sklearn.svm import LinearSVC\n",
    "# Build a classifier using the LinearSVC algorithm\n",
    "clf = LinearSVC()                           # initialize our classifier\n",
    "clf.fit(counts, fixed_target)               # fit our classifier to the training data\n",
    "print(clf.predict(count_vect.transform(['i love my iphone'])))   # try making a prediction\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All classification algorithms in scikit-learn have three important functions:\n",
    "1. an initialization function where you pass in parameters (more on this later)\n",
    "2. a \"fit\" function that learns a specific classifier on the training data. \n",
    "3. a \"predict\" function that makes predictions on new data.\n",
    "\n",
    "Remember that the classifier will only work on feature vectors.  We use our count_vect object to turn our training data into features and then we use it again to turn our new data into features.  Together our count_vect obejct and our clf object work as a classifier that decide if tweets are positive or negative.\n",
    "\n",
    "Let's try some more examples!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "i hate my iphone ['Negative emotion']\n",
      "my iphone is great ['Positive emotion']\n",
      "my iphone sucks ['Negative emotion']\n",
      "i do not love my iphone ['Positive emotion']\n"
     ]
    }
   ],
   "source": [
    "print('I hate my iphone', clf.predict(count_vect.transform(['I hate my iphone'])))  \n",
    "print('my iphone is great', clf.predict(count_vect.transform(['my iphone is great'])))  \n",
    "print('my iphone sucks', clf.predict(count_vect.transform(['my iphone sucks'])))   \n",
    "print('I do not love my iphone', clf.predict(count_vect.transform(['I do not love my iphone'])))   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hm, this all looks promising, except for the last one.  Take a second to think about why our classifier might have gotten the last one wrong (think about the feature extraction processs).\n",
    "\n",
    "Try some of your own examples.  How well do you think the classifier is working?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Our second classifier\n",
    "\n",
    "Since all machine learning algorithms take in the same type of feature input, it's easy to try using a different classifier.  If we go back to the diagram at the top and we follow the \"Not Working\" line coming out of the LinearSVC box, it takes us to the question \"Text Data\"?  We are working with text data, so we find ourselves at the \"Naive Bayes\" node.\n",
    "\n",
    "Let's switch our classifier to Naive Bayes.  This is another common type of classifier which is extremely fast and easy to deploy.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Positive emotion']\n"
     ]
    }
   ],
   "source": [
    "# Build a classifier using the Naive Bayes algorithm\n",
    "\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "\n",
    "nb = MultinomialNB()\n",
    "nb.fit(counts, fixed_target)\n",
    "\n",
    "print(nb.predict(count_vect.transform(['i love my iphone'])))   # try making a prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I hate my iphone ['Negative emotion']\n",
      "my iphone is great ['Positive emotion']\n",
      "my iphone sucks ['Negative emotion']\n",
      "I do not love my iphone ['Positive emotion']\n"
     ]
    }
   ],
   "source": [
    "print('I hate my iphone', nb.predict(count_vect.transform(['I hate my iphone'])))  \n",
    "print('my iphone is great', nb.predict(count_vect.transform(['my iphone is great'])))  \n",
    "print('my iphone sucks', nb.predict(count_vect.transform(['my iphone sucks'])))   \n",
    "print('I do not love my iphone', nb.predict(count_vect.transform(['I do not love my iphone'])))   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More on choosing an algorithm\n",
    "\n",
    "One of the most popular machine learning websites, Kaggle, did a [survey](https://www.kaggle.com/surveys/2017) in 2017 asking data scients which alogrithms they used. \n",
    "![](images/kaggle-algorithms.png)\n",
    "\n",
    "The technique we used first was a type of SVM and the technique we used second was a type of Bayesian algorithm.  These are especially good algorithms for text data.\n",
    "\n",
    "But before we get too fancy, we need to put in place a framework to evaluate our algorithms.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Takeaways\n",
    "\n",
    "1. There are many types of algorithms, but they all generally have the same API, so one great way to pick an algorithm is by trial and error.\n",
    "2. Algorithms generally have similar accuracy if configured properly and given good features.\n",
    "3. Speed of training and speed of runtime are really important things to consider when choosing an algorithm.\n",
    "4. SVMs can work great for text data, but the runtime usually gets slower with more training data.\n",
    "\n",
    "## Questions\n",
    "\n",
    "1. Is there a better way we could have done the feature extraction step?\n",
    "2. What happens when we see a new word that wasn't in the training data?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
